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Abstract 

Hyperlinks and other relations in Wikipedia 
are a extraordinary resource which is still not 
fully understood. In this paper we study the 
different types of links in Wikipedia, and con¬ 
trast the use of the full graph with respect to 
just direct links. We apply a well-known ran¬ 
dom walk algorithm on two tasks, word relat¬ 
edness and named-entity disambiguation. We 
show that using the full graph is more effec¬ 
tive than just direct links by a large margin, 
that non-reciprocal links harm performance, 
and that there is no benefit from categories 
and infoboxes, with coherent results on both 
tasks. We set new state-of-the-art figures for 
systems based on Wikipedia links, compara¬ 
ble to systems exploiting several information 
sources and/or supervised machine learning. 

Our approach is open source, with instruction 
to reproduce results, and amenable to be inte¬ 
grated with complementary text-based meth¬ 
ods. 

1 Introduction 

Hyperlinks and other relations between concepts 
and instances in Wikipedia have been successfully 
used in semantic tasks (Milne and Witten, 2013). 
Still, many questions about the best way to leverage 
those links remain unanswered. For instance, meth¬ 
ods using direct hyperlinks alone would wrongly 
disambiguate Lions in Figure 1 to B&I_Lions, a 
rugby team from Britain and Ireland, as it shares 
two direct links to potential referents in the context 
(Darrel Fletcher , a British football player, and Cape 
Town , the city where the team suffered some memo¬ 
rable defeats), while Highveld_Lions, a cricket 


Figure 1: Simplified example motivating the use of the 
full graph. It shows the disambiguation of Lions in “ Alan 
Kourie, CEO of the Lions franchise, had discussions with 
Fletcher in Cape Town”. Each mention is linked to 
the candidate entities by arrows, e.g. B&I_Lions and 
Highveld_Lions for Lions. Solid lines correspond to 
direct hyperlinks and dashed lines to a path of several 
links. An algorithm using direct links alone would incor¬ 
rectly output B& I_Lions, while one using the full graph 
would correctly choose Highveld_Lions. 

team from South Africa, has only one. When con¬ 
sidering the whole graph of hyperlinks we find that 
the cricket team is related to two cricketers named 
Alan Kourie and Duncan Fletcher and could thus 
pick the right entity for Lions in this context. In this 
paper we will study this and other questions about 
the use of hyperlinks in word relatedness (Gabrilo¬ 
vich and Markovitch, 2007) and named-entity dis¬ 
ambiguation, NED (Hachey et ah, 2012). 

Previous work on this area has typically focused 
on novel algorithms which work on a specific mix of 
resource, information source, task and test dataset 
(cf. Sect. 7). In the case of NED, the evalua¬ 
tion of the disambiguation component is confounded 
by interactions with mention spotting and candidate 




generation. With very few exceptions, there is lit¬ 
tle analysis of components and alternatives, and it is 
very difficult to learn any insight beyond the fact that 
the mix under study attained certain performance 
on the target dataset 1 . The number of algorithms 
and datasets is growing by the day, with no well- 
established single benchmark, and the fact that some 
systems arc developed on test data, coupled with 
reproducibility problems (Fokkens et al., 2013, on 
word relatedness), makes it very difficult to know 
where the area stands. There is a need for clear 
points of reference which allow to understand where 
each information source and algorithm stands with 
respect to other alternatives. 

We thus depart from previous work, seeking 
to set such a point of reference, and focus on 
a single knowledge source (hyperlinks in Wikipe¬ 
dia) with a clear research objective: given a well- 
established random walk algorithm (Personalized 
PageRank (Haveliwala, 2002)) we explore sources 
of links and filtering methods, and contrast the use 
of the full graph with respect to using just direct 
links. We follow a clear development/test/analysis 
methodology, evaluating on a extensive range of 
both relatedness and NED datasets. The results arc 
confirmed in both tasks, yielding more support to the 
findings in this research. All software and data arc 
publicly available, with instructions to obtain out-of- 
the-box replicability 2 . 

The contributions of our research are the follow¬ 
ing: (1) We show for the first time that performing 
random walks over the full graph is preferable than 
considering only direct links. (2) We study several 
sources of links, showing that non-reciprocal links 
hurt and that the contribution of the category struc¬ 
ture and links in infoboxes is residual. (3) We set the 
new state-of-the-art for systems based on Wikipe¬ 
dia links for both word relatedness and named-entity 
disambiguation. The results arc close to the best sys¬ 
tems to date, which use several information sources 
and/or supervised machine learning techniques, and 
specialize on either relatedness or disambiguation. 

*See (Hachey et al., 2012) and (Garcia et al., 2014) for two 
exceptions on NED. The first is limited to a single dataset, the 
second explores methods based on direct links, which we extend 
to using the full graph. 

2 http://ixa2.si.ehu.es/ukb/README.wiki. 
txt 


Our work shows that a careful analysis of varieties of 
graphs using a well-known random walk algorithm 
pays off more than most ad-hoc algorithms. 

The article is structured as follows. We first 
present previous work, followed by the different op¬ 
tions to build hyperlink graphs. Sect. 4 reviews ran¬ 
dom walks for relatedness and NED. Sect. 5 sets the 
experimental methodology, followed by the analysis 
and results on development data (Sect. 6) and the 
comparison to the state of the art (Sect. 7). Finally, 
Sect. 8 draws the conclusions. 

2 Previous work 

The irruption of Wikipedia has opened up enormous 
opportunities for natural language processing (Hovy 
et al., 2013), with many derived knowledge-bases, 
including DBpedia (Bizer et ah, 2009), Freebase 
(Bollacker et al., 2008), and BabelNet (Navigli and 
Ponzetto, 2012a), to name a few. These resources 
have been successfully used on semantic process¬ 
ing tasks like word relatedness, named-entity disam¬ 
biguation (NED), also known as entity linking, and 
the closely related Wikification. Broadly speaking, 
Wikipedia-based approaches to those tasks can be 
split between those using the text in the articles (e.g., 
Gabrilovich and Markovitch, 2007) and those using 
the links between articles (e.g., Guo et al., 2011). 

Relatedness systems take two words and return a 
high number if the two words are similar or closely 
related 3 (e.g. professor - student ), and a low number 
otherwise (e.g. professor - cucumber ). Evaluation is 
performed comparing the returned values to those by 
humans (Rubenstein and Goodenough, 1965). 

In NED (Hachey et al., 2012) the input is a men¬ 
tion of a named-entity in context and the output is 
the appropriate instance from Wikipedia, DBpedia 
or Freebase (cf. Figure 1). Wikification is simi¬ 
lar (Mihalcea and Csomai, 2007), but target terms 
include common nouns and only relevant terms are 
disambiguated. Note that the disambiguation com¬ 
ponent in Wikification and NED can be the same. 

Our work focuses on relatedness and NED. We 
favored NED over Wikification because of the larger 
number of systems and evaluation datasets, but our 
conclusions are applicable to Wikification, as well 

’Relatedness is more general than similarity. For the sake of 
simplicity, we will talk about relatedness on this paper. 



as other Wikipedia-derived resources. 

In this section we will focus on previous work us¬ 
ing Wikipedia links for relatedness, NED and Wik- 
ification. Although relatedness and disambiguation 
arc closely related (relatedness to context terms is 
an important disambiguation clue for NED), most 
of the systems arc evaluated in either relatedness or 
NED, with few exceptions, like WikiMiner (Milne 
and Witten, 2013), KORE (Hoffart et ah, 2012) and 
the one presented in this paper. 

Milne and Witten (2008a) arc the first to use hy¬ 
perlinks between articles for relatedness. They com¬ 
pare two articles according to the number of incom¬ 
ing links that they have in common (i.e. overlap of 
direct-links) based on Normalized Google Distance 
(NGD), combined with several heuristics and col¬ 
location strength. In later work (Milne and Witten, 
2013), they incorporated machine learning. The au¬ 
thors also apply their technique to NED (Milne and 
Witten, 2008b), using their relatedness measures to 
train a supervised classifier. Unfortunately they do 
not present results of their link-based method alone, 
so we decided to reimplement it (cf. Sect. 6). We 
show that, under the same conditions, using the full- 
graph is more effective in both tasks. We also run 
their out-of-the-box system 4 on the same datasets as 
ours (cf. Sect. 7), with results below ours. 

Apart from hyperlinks between articles, other 
works on relatedness use the category structure 
(Strube and Ponzetto, 2006; Ponzetto and Strube, 
2007; Ponzetto and Strube, 2011) to run path-based 
relatedness algorithms which had been successful 
on WordNet (Pedersen et ah, 2004), or use rela¬ 
tions in infoboxes (Nastase and Strube, 2013). In 
all cases, they obtain performance figures well be¬ 
low hyperlink-based systems (cf. Sect. 7). We will 
explore the contribution of such relations (cf. Sect. 
3), incorporating them to the hyperlink graph. 

Attempts to use the whole graph of hyperlinks for 
relatedness have been reported before. Yeh et al. 
(2009) obtained very low results on relatedness us¬ 
ing an algorithm based on random walks similar to 
ours. Si mi lar in spirit, Yazdani and Popescu-Belis 
(2013) built a graph derived from the Freebase Wiki¬ 
pedia Extraction dataset, which is derived but richer 

4 https://sourceforge.net/projects/ 
wikipedia-miner/ 


than Wikipedia. Even if they mix hyperlinks with 
textual similarity, their results are lower than ours. 
One of the key differences with these systems is that 
we remove non-reciprocal links (cf. Sect. 3). 

Regarding link-based methods for NED, there is 
only one system which relies exclusively on hyper¬ 
links. Guo et al. (2011) use direct hyperlinks be¬ 
tween the target entity and the mentions in the con¬ 
text, counting the number of such links. We show 
that the use of the full graph produces better results. 

The rest of NED systems present complex combi¬ 
nations. Lemahnn et al. (2010) present a supervised 
system combining features based on hyperlinks, cat¬ 
egories, text similarity and relations from infoboxes. 
Despite their complex and rich system, we will show 
that they perform worse than our system. Hachey et 
al. (2011) explored hyperlinks beyond direct links 
for NED, building subgraphs for each context us¬ 
ing paths of length two departing from the context 
terms, combined with text-based relatedness. We 
will show that the full graph is more effective than 
limiting the distance to two, and report better results 
than their system. Several authors have included di¬ 
rect links using the aforementioned NGD in their 
combined systems (Ratinov et al., 2011; Hoffart et 
al., 2011). Unfortunately, they do no report separate 
results for the NGD component. In very recent work 
Garcia et al. (2014) compare NGD with several other 
algorithms using direct links, but do not explore the 
full graph, or try to characterize links. We will see 
that their results are well below ours (cf. Sect. 7). 

Graph-based algorithms for relatedness and dis¬ 
ambiguation have been successfully used on other 
resources, particularly WordNet. Hughes and Ram- 
age (2007) were the first presenting a random walk 
algorithm over the WordNet graph. Agirre et al. 
(2010) improved over their results using a simi¬ 
lar - random walk algorithm on several variations of 
WordNet relations, reporting the best results to date 
among WordNet-based algorithms. The same al¬ 
gorithm was used for word sense disambiguation 
(Agirre et al., 2014), also reporting state-of-the-art 
results. We use the same open source software in 
our experiments. As an alternative to random walks, 
Tsatsaronis et al. (2010) use a path-based system 
over the WordNet relation graph. 

In more recent work (Navigli and Ponzetto, 
2012b; Pilehvar et al., 2013), the authors present two 



relatedness algorithms for BabelNet, an enriched 
version of WordNet including articles from Wikipe¬ 
dia, hyperlinks and cross-lingual relations from non- 
English Wikipedias. In related work, Moro et al. 
(2014) present a multi-step NED algorithm on Ba¬ 
belNet, building semantic graphs for each context. 
We will show that Wikipedia hyperlinks alone arc 
able to provide similar performance on both tasks. 

3 Building Wikipedia Graphs 

Wikipedia pages can be classified into main articles, 
category pages, redirects and disambiguation pages. 
Given a Wikipedia dump (a snapshot from April 4, 
2013), we mine links between articles, between arti¬ 
cles and category pages, as well as the links between 
category pages (the category structure). Our graphs 
include a directed edge from one article to another 
iff the text of the first article contains a hyperlink to 
the second article. In addition, we also include hy¬ 
perlinks in infoboxes. 

The graph contains two types of nodes (articles 
and categories) and three types of directed edges: 
hyperlinks from article to article (H), infobox links 
from article to article (I), links from article to cate¬ 
gory and links from category to category (C). 

We constructed several graphs using different 
combinations of nodes and edges. In addition to the 
directed versions (D) we also constructed an undi¬ 
rected version (U), and a reduced graph which only 
contains links which arc reciprocal (R), that is, we 
add a pair of edges between al and a2 if and only 
if there exists a hyperlink from al to a2 and from 
a2 to al. Reciprocal links capture the intuition that 
both articles are relevant to each other, and tackle is¬ 
sues with links to low relevance articles, e.g. links 
to articles on specific years like 1984. Some authors 
weight links according to their relevance (Milne and 
Witten, 2013). Our heuristic to keep only reciprocal 
links can be seen as a simpler, yet effective, method 
to avoid low relevance links. 

Table 1 gives the number of nodes and edges in 
some selected graphs. The graph with less edges 
is the one with reciprocal hyperlinks Hr, and the 
graphs with most edges arc those with undirected 
edges, as each edge is modeled as two directed 
edges 5 . The number of nodes is similar in all, except 

3 This was done in order to combine undirected and recipro- 


Graph 

Edges 

Nodes 

RG 

TAC09 20 o 

CD 

18.803K 

4.873K 

51.1ft 

49.5 f t 

Cu 

37.598K 

4.873K 

72.9 f t 

65.5 f t 

Id 

6.572K 

1.860K 

43.1ft 

57.0 f t 

Iu 

12.692K 

1.860K 

52.8 f t 

65.5 f t 

Hd 

90.674K 

4.103K 

75.1ft 

65.0 f t 

Hu 

165.258K 

4.103K 

76.6 t 

66.0 f t 

Hr 

16.338K 

2.955K 

88.4 

68.5 

HrCu 

53.005K 

4.898K 

78.2 t 

67.5 t 

HrIu 

26.394K 

3.273K 

82.9 t 

68.0 t 

HrCuIu 

63.184K 

4.900K 

75.6 f t 

67.5 t 


Table 1: Statistics for selected graphs and results on de¬ 
velopment data for relatedness (RG, Spearman) and NED 
(TAC092ooi accuracy) with default parameters (see text). 
See Sect. 4.1 for abbreviations, f for stat. significant 
differences with Hr in either RG or TAC092 oo- t for 
stat. signif. when comparing on all relatedness or NED 
datasets. 


Article 

Freq. 

Prob. 

Gotham_City 

32 

0.38 

GothamT magazine) 

15 

0.18 

New_York_City 

1 

0.01 

Gotham_Records 

1 

0.01 


Table 2: Partial view of dictionary entry for “gotham”. 
The probability is calculated as the ratio between the fre¬ 
quency and the total count. 

for the infobox graphs (infoboxes are only available 
for a few articles), and the reciprocal graph Hr. as 
relatively few nodes have reciprocal edges. 

3.1 Building the dictionary 

In order to link running text to the articles in the 
graph, we use a dictionary, i.e., a static association 
between string mentions with all possible articles the 
mention can refer to. 

We built our dictionary from the same Wikipedia 
dump, using article titles, redirections, disambigua¬ 
tion pages, and anchor text. Mention strings arc 
lowercased and all text between parentheses is re¬ 
moved. If an anchor links to a disambiguation page, 
the text is associated with all possible articles the 
disambiguation page points to. Each association be¬ 
tween a mention and article is scored with the prior 
probability, estimated as the number of times that 
the mention occurs in an anchor divided by the to- 

cal edges, and could be avoided in other cases. 



Drink Alcohol 


Drink 

.124 

Alcohol 

.145 

Alcoholic_beverage 

.036 

Alcoholic_beverage 

.026 

Drinking 

.028 

Ethanol 

.018 

Coffee 

.020 

Alkene 

.006 

Tea 

.017 

Alcoholism 

.006 


Table 3: Sample of the probability distribution returned 
by PPR for two words. Top five articles shown. 

tal number of occurrences of the mention as anchor. 
Note that our dictionary can disambiguate any men¬ 
tion, just returning the highest-scoring article. Table 
2 partially shows a sample entry in our dictionary. 

4 Random Walks 

The PageRank random walk algorithm (Brin and 
Page, 1998) is a method for ranking the vertices in 
a graph according to their relative structural impor¬ 
tance. PageRank can be viewed as the result of a 
random walk process, where the final rank of node i 
represents the probability of a random walk over the 
graph ending on node i, at a sufficiently large time. 

Personalized PageRank (PPR) is a variation of 
PageRank (Haveliwala, 2002), where the query of 
the user defines the importance of each node, biasing 
the resulting PageRank score to prefer nodes in the 
vicinity of the query nodes. The query bias is also 
called the teleport vector. PPR has been successfully 
used on the WordNet graph for relatedness (Hughes 
and Ramage, 2007; Agirre et al., 2010) and WSD 
(Agirre and Soroa, 2009; Agirre et al., 2014). In 
our experiments we use UKB version 2.1 6 , an open 
source software for relatedness and disambiguation 
based on PPR. For the sake of space, we will skip the 
details, and refer the reader to those papers. PPR has 
two parameters: the number of iterations, and the 
damping factor, which controls the relative weight 
of the teleport vector. 

4.1 Random walks on Wikipedia 

Given a dictionary and graph derived from Wikipe¬ 
dia (cf. Sect. 3), PPR expects a set of mentions, i.e., 
a set of strings which can be linked to Wikipedia ar¬ 
ticles via the dictionary. The method first initializes 
the teleport vector: for each mention in the input, the 
articles in the respective dictionary entry arc set with 
an initial probability, and the rest of articles arc set to 

6 http://ixa2.si.ehu.es/ukb 


zero. We explored two options to set the initial prob¬ 
ability of each article: the uniform probability or the 
prior probability in the dictionary. When an article 
appears in the dictionary entry for two mentions, the 
initial probability is summed up. In a second step, 
we apply PPR for a number of iterations, producing 
a probability distribution over Wikipedia articles in 
the form of a PPR vector (ppv). 

The probability vector can be used for both re¬ 
latedness and NED. For relatedness we produce a 
PPV vector for each of the words to be compared, 
using the single word as input mention. The relat¬ 
edness between the target words is computed as the 
cosine between the respective PPV vectors. In order 
to speed up the computation, we can reduce the size 
of the PPV vectors, setting to zero all values below 
rank k after ordering the values in decreasing order. 

Table 3 shows the top 5 articles in the PPV vec¬ 
tors of two sample words. The relatedness between 
pairs Drink and Alcohol would be non-zero, as their 
respective vectors contain common articles. 

For NED the input comprises the target entity 
mention and its context, defined as the set of men¬ 
tions occurring within a 101 token window centered 
in the target. In order to extract mentions to articles 
in Wikipedia from the context, we match the longest 
strings in our dictionary as we scan tokens from left 
to right. We then initialize the teleport probability 
with all articles referred by the mentions. After com¬ 
puting Personalized PageRank, we output the article 
with highest rank in PPV among the possible articles 
for the target entity mention. Figure 1 shows an ex¬ 
ample of NED. 

If the prior is being used to initialize weights, 
we multiply the prior probability with the Pagerank 
probabilities before computing the final ranks. In the 
rare cases 7 where no known mention is found in the 
context, we return the node with the highest prior. 

Note that our NED and relatedness algorithms are 
related. NED is using using relatedness, as Pagerank 
probabilities are capturing how related is each can¬ 
didate article to the context of the mention. Follow¬ 
ing the first-order and second-order co-occurrence 
abstraction (Islam and Inkpen, 2006; Agirre and Ed¬ 
monds, 2007, Ch. 6), we can interpret that we do 
NED using first-order relatedness, while our relat- 

7 Less than 3% of instances. 



1. Graphs in Table 1 (default: Hr) 

2. Number of iterations in PageRank 

i G {1, 2, 3,4, 5,10,15 ... 50} (default: 30) 

3. Damping factor in PageRank: 

a. € {0.75,0.8,0.85,0.90,0.95,0.99} (de¬ 
fault: 0.85) 

4. Initializing with prior or not (P or —«P) (de¬ 
fault: P) 

5. Relatedness: number of values in PPV: 

k € {100,200,500,1000,2000,5000,10000} 
(default: 5000) 

Figure 2: Summary of variants and parameters as well as 
the default values for each of them. 


Name 

Reference 

# 

RG 

(Rubenstein and Goodenough, 1965) 

65 

MC 

(Miller and Charles, 1991) 

30 

353 

(Gabrilovich and Markovitch, 2007) 

353 

TSA 

(Radinsky et al., 2011) 

287 

KORE 

(Hoffart et al., 2012) 

420 

TAC09 

(McNamee et al., 2010) 

1675 

TAC 10 

http://www.nist.gov/tac/ 

1020 

TACO 

http://www.nist.gov/tac/ 

1183 

AIDA 

(Hoffart et al., 2011) 

4401 

KORE 

(Hoffart et al., 2012) 

143 


Table 4: Summary of relatedness (top) and NED (bottom) 
datasets. Rightmost column for number of instances. 


edness uses second-order relatedness. 

Figure 2 summarizes all parameters mentioned so 
far, as well as their default values, which were set 
following previous work (Agirre et al., 2010; Agirre 
et al., 2014). 

5 Experimental methodology 

We summarize the datasets used in Table 4. RG, 
MC and 353 are the most used relatedness datasets 
to date, with TSA and KORE being more recent 
datasets where some top-ranking systems have been 
evaluated. Word relatedness datasets were lenrma- 
tized and lowercased, except for KORE, which is an 
entity relatedness dataset where the input comprises 
article titles 8 . Following common practice rank- 
correlation (Spearman) was used for evaluation. 

8 We had to manually adjust the articles in KORE. as the 
exact title depends on the Wikipedia version. We missed 3 for 
our 2013 version, which could slightly degrade our results. 


Regarding NED, the TAC Entity Linking compe¬ 
tition is held annually. Due to its popularity it is use¬ 
ful to set the state of the art. We selected the datasets 
in 2009 and 2010, as they have been used to evalu¬ 
ate several top ranking systems, as well as the 2013 
dataset, which is the most recent. In addition, we 
also provide results for AIDA, the largest and only 
dataset providing annotations for all entities in the 
documents, and KORE, a recent, very small dataset 
focusing on difficult mentions and short contexts. 
Evaluation was performed using accuracy, the ratio 
between correctly disambiguated instances and the 
total number of instances that have a link to an entity 
in the knowledge base 9 . Each dataset uses a different 
Wikipedia version, but fortunately Wikipedia keeps 
redirects from older article titles to the new version. 
As customary in the task, we automatically map the 
articles returned by our system to the version used 
in the gold standard. 

Following standard practice in NED, we do not 
evaluate mention detection 10 , that is, the datasets al¬ 
ready specify which arc the target mentions. Note 
that TAC provides so called “queries” which can 
be substrings of the full mention, e.g. “Smith” for 
a mention like “John Smith”). Given a mention, 
we devised the following heuristics to improve can¬ 
didate generation: (1) remove substring contained 
in parenthesis from the mention, then check dictio¬ 
nary, (2) if not found, remove “the” if first token in 
the mention, then check dictionary, (3) if not found, 
remove middle token if mention contains three to¬ 
kens, then check dictionary, (4) if not found, search 
for a matching entity using the Wikipedia API 11 . 
The heuristics provide an improvement of around 4 
points on development. Later analysis showed that 
these heuristics seem to be only relevant on the TAC 
datasets, because of the way the query strings arc 
designed, but not on AIDA or KORE. 

5.1 Development and test 

We wanted to follow a standard experimental design, 
with a clear - development/test split for each task. Un¬ 
fortunately there is no standard split in the literature, 

^Corresponds to non-NIL accuracy at TAC-KBP (also called 
KB accuracy) and Micro P@ 1.0 in (Hoffart et al., 2011) 

10 See (Cornolti et al., 2013) for a framework to evaluate both 
mention detection and disambiguation. 

11 http://en.Wikipedia.org/w/api.php 



and the choice is difficult: The development dataset 
should be representative enough to draw conclusions 
on different alternatives and parameters, but at the 
same time the most relevant datasets in the literature 
should be left for testing, in order to have enough 
points for comparison. In addition, some recent al¬ 
gorithms suposedly setting the state of the art arc 
only tested on newly produced datasets. Note also 
that relatedness datasets arc small, making it diffi¬ 
cult to find statistically significant differences. 

In order to strike a balance between the need for 
in-depth analysis and fair comparison to previous re¬ 
sults, we decided to focus on the two oldest datasets 
from each task for development and analysis: RG for 
relatedness and a subset of 200 polysemic instances 
from TAC09 for NED (TAC092oo) 12 - The rest will 
be used for test, where the parameters have been set 
on development. Given the need for significant con¬ 
clusions, we re-checked the main conclusions drawn 
from development data using the aggregation of all 
test datasets, but only after the comparison to the 
state of the art had been performed. This way we 
ensure both a fair comparison with the state of the 
art and a well-grounded analysis. 

We performed significance tests using Fisher’s z- 
transformation for relatedness (Press et al., 2002, 
equation 14.5.10), and paired bootstrap resampling 
for NED (Noreen, 1989), accepting differences with 
p-value < 0.05. Given the small size of the datasets, 
when necessary, we also report statistical signifi¬ 
cance when joining all datasets as just mentioned. 

6 Studying the graph and parameters 

In this section we study the performance of the dif¬ 
ferent graphs and parameters on the two develop¬ 
ment datasets, RG and TAC09 2 oo- The next section 
reports the results on the test sets for the best param¬ 
eters, alongside state-of-the-art system results. 

As mentioned in Sect. 4.1, PPR has several param¬ 
eters and variants (cf. Figure 2). We first checked 
exhaustively all possible combinations for different 
graphs, with the rest of parameters set to default 
values. We then optimized each of the parameters 
in turn, seeking to answer the following questions: 

Which links help most? Table 1 shows the 

12 The dataset in http://ixa2.si.ehu.es/ukb/ 
README . wiki . txt includes the subset. 


Graph 

Param. 

RG 

Param. 

TAC09 200 

Hr 

default 

88.4 

default 

68.5 

Hr 


87.0 

^P 

49.0 f 

Hr 

o0.85 

88.4 

aO.85 

68.5 

Hr 

z30 

88.4 

*15 

68.5 

Hr 

fc5000 

88.4 

- 

- 


Table 5: Parameters: Summary of results on development 
data for relatedness (RG, Spearman correlation) and NED 
(TAC092oo, accuracy) for several parameters using Hr 
graph. Parameters are set to default values (see text) ex¬ 
cept for the one noted explicitly, f for statistical signifi¬ 
cant differences with respect to default. 

results for selected graphs. The first seven rows 
present the results for each edge source in isola¬ 
tion, both using directed and undirected edges. Cat¬ 
egories and infoboxes suffer from producing smaller 
graphs, with the hyperlinks yielding the best results. 
The undirected versions improve over directed links 
in all cases, with the use of reciprocal edges for hy¬ 
perlinks obtaining the best results overall (the graphs 
with reciprocal edges for categories and infoboxes 
were too small and we omit them). The trend is the 
same in both relatedness and NED, highlighting the 
robustness of these results. 

Regarding combined graphs, we report the most 
significant combinations. The reciprocal graph of 
hyperlinks outperforms all combinations (including 
the combinations which were omitted), showing that 
categories and infoboxes do not help or even degrade 
slightly the results. The differences are statistically 
significant (either on the individual datasets or in the 
aggregation on all datasets) in all cases, confirming 
that Hr is significantly better. 

The degradation or lack of improvement when us¬ 
ing infoboxes is surprising. We hypothesized that it 
could be caused by non-reciprocal links in HrIu. 
In fact, removing non-reciprocal links from HrIu 
improved results slightly on NED, matching those 
of Hr. This lack of improvement with infoboxes, 
even when removing non-reciprocal links, can be 
explained by the fact that only 5% of reciprocal links 
in Iu are not in Hr. It seems that this additional 
5% is not helping in this particular' dataset. Re¬ 
garding categories, the category structure is mostly a 
tree, which is a structure where random walks do not 
seem to be effective, as already observed in (Agirre 
et ah, 2014) for WordNet. 



Graph 

Method 

RG 

TAC09 200 

Hr 

NGD 

81.8 J 

57.5f 

Hr 

PPR (1 iter.) 

43.4 f t 

60.5f J 

Hr 

PPR (2 iter.) 

78.3 t 

66.Of t 

Hr 

PPR default 

88.4 

68.5 


Table 6: Result when using single links, compared to the 
use of the full graph on development data. We reimple¬ 
mented NGD. f for stat. signif. difference with PPR. t 
for stat. signif. using all datasets. 


Graph 

Method 

Year 

RG 

TAC09 200 

Hr 

PPR default 

2010 

86.3 

68.5 

Hr 

PPR default 

2011 

85.6 

70.5 

Hr 

PPR default 

2013 

88.4 

68.5 


Table 7: PPR using different Wikipedia versions 


Is initialization of random walks important? 

The second row in Table 5 reports the result when 
using uniform distributions when initializing the 
random walks (instead of prior probabilities). The 
results degrade in both datasets, the difference be¬ 
ing significant only for NED. This was later con¬ 
firmed in the rest of relatedness and NED datasets: 
using prior probabilities for initialization improves 
results in all cases, but it is only significant in NED 
datasets. These results show that relatedness is less 
sensitive to changes in the distribution of meanings, 
that is, using the more informative prior distributions 
of meaning only improves results slightly. NED, on 
the contrary, is more sensitive, as the distribution of 
senses affects dramatically the performance. 

Is the value of a and i important? The best 
a on both datasets was obtained with default values 
(cf. Table 5), in agreement with related work using 
WordNet (Agirre et al., 2010). The lowest number of 
iterations where convergence was obtained were 30 
and 15, respectively, although as few as 5 iterations 
yielded very similar performance (87.1 on related¬ 
ness, 68.0 on NED). 

Is the size of the vector, k, important for relat¬ 
edness? The best performance was attained for the 
default k, with minor variations for k > 1000. 

Is the full graph helping? When the PPR algo¬ 
rithm does a single iteration, we can interpret that it 
is ranking all entities using direct links. When do¬ 
ing two iterations, we can loosely say that it is using 
links at distance two, and so on. Table 6 shows that 


PPR is able to take profit from the full graph well be¬ 
yond 2 iterations, specially in relatedness. These re¬ 
sults were confirmed in the full set of datasets, with 
statistically significant differences in all cases. 

In addition, we reimplemented the relatedness 
and NED algorithms based on NGD over direct 
links (Milne and Witten, 2008a; Milne and Witten, 
2008b), allowing to compare them to PPR on the 
same experimental conditions. We first developed 
the relatedness algorithm 13 . Table 6 reports the best 
valiant, which outperforms the 0.64 on RG reported 
in their paper. We followed a similar methodol¬ 
ogy for NED 14 . Table 6 shows the results for NGD, 
which performs worse than PPR. This trend was con¬ 
firmed on the full set of datasets for relatedness and 
NED with statistical significance in all cases except 
KORE, which is the smallest NED dataset. Figure 
1 illustrates why the use of longer paths is benefi¬ 
cial. In fact, NGD returns 0.14 for B& IJLions and 
0.13 for Highveld_Lions, but PPR correctly re¬ 
turns 0.05 and 0.75, respectively. 

How important is the Wikipedia version? Table 
7 shows that the versions we tested arc not affecting 
the results dramatically, and that using the last ver¬ 
sion does not yield better results in NED. Perhaps 
the larger size and number of hyperlinks of newer 
versions would only affect new articles and rare arti¬ 
cles, but not the ones present in TAC09200- We kept 
using 2013 for test. 

What is the efficiency of the algorithm? The 

initialization takes around 5 minutes 15 , where most 
of the time is spent loading the dictionary into mem¬ 
ory, 4m50s. Using a database instead, initializa¬ 
tion takes 10s. Memory requirements for Hr were 

l3 In order to replicate the NGD relatedness algorithm, we 
checked the open source code available, exploring the use of 
inlinks and outlinks and the use of maximum pairwise article 
relatedness. We also realized that the use of priors (“common¬ 
ness” according to the terminology in the paper) was hurting, so 
we dropped it. We checked both reciprocal and unidirectional 
versions of the hyperlink graph, with better results for the recip¬ 
rocal graph. 

14 We checked both reciprocal and undirected graphs with 
similar results, combined with prior (similar results), weighted 
terms in the context (with improvement) and checked the use 
of ambiguous mentions in the context (marginal improvement). 
Reported results correspond to reciprocal, combination with 
prior, weighting terms and using only monosemous mentions. 

l3 Time measured in a single server with Xeon E7-4830 8 core 
processors, 2130 MHz, 64 GB RAM. 




Source 


353 


TSA 


MC KORE 


RG 


(Ponzetto and Strube, 2011) 

Wiki 11 

c 


75.0* 









(Nastase and Strube, 2013) 

Wiki 13 

ci 


67.0 









(Milne and Witten, 2013) 

Wiki 13 

la 


69.5r 


59.7r 


35.8r 


77.2r 


65.9r 

(Yeh et al„ 2009) 

Wiki09 

g 




48.5 







PPR default Hr 

Wikil3 

g 

0 

88.4* 

1 

72.8 

1 

64.1 

1 

81.0 

1 

66.2 

(Agirre et al., 2010) 

WNet 

g 

T 

86.2r 


68.5 


45.4r 

T 

85.2r 



(Tsatsaronis et al., 2010) 

WNet 

g 


86.1 


61.0 







(Navigli and Ponzetto, 2012b) 

WNet+Wikil2 (cl) 

g+CL 




65.0 



l 

90.0 



(Pilehvar et al., 2013) 

WNet+Wikil3 

g 


86.8* 









PPR default Hr 

Wikil3 

g 

0 

88.4* 

2 

72.8 

1 

64.1 

4 

81.0 

1 

66.2 

PPR default Hr 

WNet+Wikil3 

g 

0 

91.8* 

1 

78.5 

2 

62.9 

2 

87.6 

1 

66.2 

(Gabrilovich and Markovitch, 2007) 

Wiki07 

t 


82.0 


75.0 


59.0 


73.0 



(Hoffart et al., 2012) 

Wiki 12 

t 









0 

69.8* 

(Yazdani and Popescu-Belis, 2013) 

Freebase 

gt 




70.0* 







(Radinsky et al., 2011) 

Time 

C 



1 

80.0 

1 

63.0 





(Baroni et al., 2014) 

Corpus 

c 


84.0* 


71.0 







(Agirre et al., 2009) 

WNet+Corpus 

Cg+SUP 

0 

96.0x 


78.Ox 







(Milne and Witten, 2013) 

Wiki 13 

la+SUP 


83.5r 


74.0x 


52.8r 


81.3r 

1 

66.5r 

PPR default Hr 

WNet+Wikil3 

g 

0 

91.8* 

2 

78.5 

2 

62.9 

2 

87.6 

2 

66.2 


Table 8: Spearman results for relatedness systems. The source column includes codes for information used (t for 
article text, 1 for direct hyperlinks, g for hyperlink graph, c for categories, i for infoboxes, a for anchor text) and other 
information sources (CL for crosslingual links, C for corpora, SUP for supervised Machine Learning). The results 
include the following codes: * for best reported result among several variants, x for cross-validation result, r for third- 
party system ran by us. We also include the rank of our PPR system in each group or rows, including the systems above 
it (excluding * and x systems, which get rank 0 if they are top rank). 


4.7 Gb, down to 1.1 Gb when using the database. 
The main bottleneck of our system is the computa¬ 
tion of Personalized PageRank, each iteration taking 
around 0.60 seconds. We are currently checking fast 
approximations for Pagerank, and plan to improve 
efficiency. 

7 Comparison to related work 

In the previous section we presented several re¬ 
sults on the same experimental conditions. We now 
use the graph and parametrization which yield the 
best results on development (default parameters with 
Hr). Comparison to the state of the art is compli¬ 
cated by many systems reporting results on differ¬ 
ent datasets, which causes the tables in this section 
to be rather sparse. The comparison for relatedness 
is straightforward, but, in NED, it is not possible 
to factor out the impact of the candidate generation 
step. Given the fact that our candidate generation 
procedure is not particularly sophisticated, we don’t 
think this is a decisive factor in favour of our results. 

Table 8 and 9 report the results of the best sys¬ 


tems on both tasks. Given that several systems were 
developed on test data, we also report our results on 
RG and TAC2009, marking all such results (see cap¬ 
tion of tables for details). We split the results in 
both tables in three sets: top rows for systems us¬ 
ing link and graph information alone, middle rows 
for link- and graph-based systems using WordNet 
and/or Wikipedia, and bottom rows for more com¬ 
plex systems. We report the results of our system re¬ 
peatedly in each set of rows, for easier comparison. 
Our main focus is on the top rows, which show the 
superiority of our results with respect to other sys¬ 
tems using Wikipedia links and graphs. The middle 
and bottom rows show the relation to the state of the 
art. 

For easier exposition, we will examine the results 
by row section simultaneously on relatedness and 
NED. The top rows in Table 8 report four related¬ 
ness systems which have already been presented in 
Sect. 2, showing that our system is best in all live 
datasets. Note that the (Milne and Witten, 2013) row 
was obtained running their publicly available system 



System Source TAC2009 TAC2010 TAC2013 AIDA KORE50 


MFS baseline 

Wiki 13 

1 


68.3 


73.7 


72.7 


69.0 


36.4 

(Guo et ah, 2011) 

Wiki 10 

1 

1 

74.0 


74.1 







(Milne and Witten, 2013) 

Wiki 13 

la 


57.4r 


58.5r 


37. lr 


56.Or 


35.7r 

(Garcia et ah, 2014) 

Wiki 12 

1 




76.6 







PPR default Hr 

Wiki 13 

g 

0 

78.8* 

1 

83.6 

1 

81.7 

1 

80.0 

1 

60.8 

(Moro et ah, 2014) 

WNet+Wikil3 

g+CL 







T 

82.1 

T 

71.5 

PPR default Hr 

Wiki 13 

g 

0 

78.8* 

1 

83.6 

1 

81.7 

2 

80.0 

2 

60.8 

(Bunescu and Pasca, 2006) 

Wiki 11 

tc 

IT 

83.8ra* 


68.4ra 







(Cucerzan, 2007) 

Wiki 11 

tc 

0 

83.5ra* 


78.4ra 




51.0ro 



(Hachey et ah, 2011) 

Wiki 11 

teg 




79.8* 







(Hoffart et ah, 2012) 

Wiki 12 

t 







0 

81.8* 

0 

64.6* 

(Hoffart et ah, 2011) 

Wiki 11 

tli+SUP 







0 

81.8* 



(Milne and Witten, 2013) 

Wiki 13 

la+SUP 


57.5r 


63.4r 


40. Or 


55.6r 


37.lr 

Best TAC KBP system 

— 

— 

l 

76.5 


80.6 


77.7 





PPR default Hr 

Wiki 13 

g 

0 

78.8* 

1 

83.6 

1 

81.7 

2 

80.0 

2 

60.8 


Table 9: Accuracy of NED systems, using the same codes as in in Table 8. Some early systems have been re¬ 
implemented and tested by others: ra for (Hachey et al., 2012), ro (Hoffart et al., 2011). We report rank of our PPR 
system in each group or rows, including systems above (excluding * systems, which get rank 0 if they are top rank). 


with the supervised Machine Learning component 
turned off (see below for the results using SUP). The 
top rows of table 9 report the most frequent baseline 
(as produced by our dictionary) and three link-based 
systems (cf. Sect. 2), showing that our method is 
best in all five datasets. These results show that the 
use of the full graph as devised in this paper is a 
winning strategy. 

The relatedness results in the middle rows of Ta¬ 
ble 8 include several systems using WordNet and/or 
Wikipedia (cf. Sect. 2), including the system in 
(Agirre et al., 2010), which we run out-of-the-box 
with default values. To date, link-based systems 
using WordNet had reported stronger results than 
their counterparts on Wikipedia, but the table shows 
that our Wikipedia-based results arc the strongest on 
all relatedness datasets but one (MC, the smallest 
dataset, with only 30 pairs). In addition, the table 
shows our results when combining random walks on 
Wikipedia and WordNet 16 , which yields improve¬ 
ments in most datasets. In the counterpart for NED 
in Table 9, Moro et al. (2014) outperform our sys¬ 
tem, specially in the smaller KORE (143 instances), 
but note that they use a richer graph which com¬ 
bines WordNet, the English Wikipedia and hyper¬ 
links from other language Wikipedias. 

Finally, the bottom rows in both tables report the 

16 We multiply the scores of PPR on Wikipedia and WordNet. 


best systems to date. For lack of space, we cannot 
review systems not using Wikipedia links. Regard¬ 
ing relatedness, we can see that our combination of 
WordNet and Wikipedia would rank second in all 
datasets, with only one single system (based on cor¬ 
pora) beating our system in more than one dataset 
(Radinsky et ah, 2011). Regarding NED, our system 
ranks first in the TAC datasets, including the best 
systems that participated in the TAC competitions 
(Varma et al., 2009; Lehmann et ah, 2010; Cucerzan 
and Sil, 2013), and second to (Moro et ah, 2014) on 
AIDA and KORE. 

8 Conclusions and Future Work 

This work departs from previous work based on Wi¬ 
kipedia and derived resources, as it focuses on a 
single knowledge source (links in Wikipedia) with 
a clear research objective: given a well-established 
random walk algorithm we explored which sources 
of links and filtering methods are useful, contrast¬ 
ing the use of the full graph with respect to us¬ 
ing just direct links. We follow a clear develop¬ 
ment/test/analysis methodology, evaluating on a ex¬ 
tensive range of both relatedness and NED datasets. 
All software and data are publicly available, with in¬ 
structions to obtain out-of-the-box replicability 17 . 

17 http://ixa2.si.ehu.es/ukb/README.wiki. 
txt 



We show for the first time that random walks over 
the full graph of links improve over direct links. We 
studied several variations of sources of links, show¬ 
ing that non-reciprocal links hurt and that the con¬ 
tribution of the category structure and relations in 
infoboxes is residual. This paper sets a new state-of- 
the-art for systems based on Wikipedia links on both 
word relatedness and named-entity disambiguation 
datasets. The results arc close to those of the best 
combined systems, which specialize on either relat¬ 
edness or disambiguation, use several information 
sources and/or supervised machine learning tech¬ 
niques. This work shows that a careful analysis 
of varieties of graphs using a well-known random 
walk algorithm pays off more than most ad-hoc al¬ 
gorithms proposed up to date. 

For the future, we would like to explore ways to 
filter out informative hyperlinks, perhaps weighting 
edges according to their relevance, and would also 
like to speed up the random-walk computations. 

This article showed the potential of the graph of 
hyperlinks. We would like to explore combinations 
with other sources of information and algorithms, 
perhaps using supervised machine learning. For re¬ 
latedness, we already showed improvement when 
combining with random walks over WordNet, but 
would like to explore tighter integration (Pilehvar 
et ah, 2013). For NED, local methods (Ratinov et 
al., 2011; Han and Sun, 2011), global optimization 
strategies based on keyphrases in context like KORE 
(Hoffart et ah, 2012) and doing NED jointly with 
word sense disambiguation (Moro et ah, 2014), all 
are complementary to our method and thus promis¬ 
ing directions. 
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